Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

نویسندگان

  • Dominik Göddeke
  • Mirco Altenbernd
  • Dirk Ribbrock
چکیده

We examine novel fault tolerance schemes for data loss in multigrid solvers which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques, that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Consensus-Based Fault-Tolerant Event Logger for High Performance Applications

High-performance computing (HPC) systems traditionally employ rollback-recovery techniques to allow faulttolerant executions of parallel applications. Rollback-recovery based on message logging is an attractive strategy that avoids the drawbacks of coordinated checkpointing in systems with low mean-time between failures (MTBF). Most message logging protocols rely on a centralized event logger t...

متن کامل

Modeling of Hierarchical Distributed Systems with Fault-Tolerance

Absfracf-This paper addresses some fault-tolerant issues pertaining to hierarchically distr ibuted systems. Since each o f the levels in a hierarchical system could have various characteristics, different faulttolerance schemes could he appropriate at different levels. I n this paper, we use stochastic Pet r i nets (SPN's) to investigate various faulttolerant schemes in this context. The basic ...

متن کامل

Asynchronous parallel solvers for linear systems arising in computational engineering

Modern trends in Computational Science and Engineering are moving towards the use of computer systems with ever increasing numbers of computational cores. A consequence of this is that over the next decade it will be necessary to develop and apply new numerical algorithms that are far more scalable than has historically been required. Ideally, such algorithms will be able to exploit many thousa...

متن کامل

A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend...

متن کامل

Metapromela: A Toolkit for Simulation of Checkpointing Algorithms

Distributed checkpointing algorithms play an important role in the majority of the fault tolerant software components existent today. Unfortunately, there is a lack of comprehensive and uniform performance testing of those algorithms. Our research focuses on the provision of a toolkit, Metapromela, that helps with the implementation and testing of distributed checkpointing algorithms. This pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Parallel Computing

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2015